{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e136a5a2-64bf-48dc-a87b-dbf47ae8f415",
   "metadata": {},
   "source": [
    "# 快速入门 语音识别模型 Whisper \n",
    "\n",
    "OpenAI 提供了两个基于开源的 Whisper large-v2 模型的语音到文本API服务：\n",
    "- 转录（transcriptions）：将音频转录为音频所使用的任何语言。\n",
    "- 翻译（translations）：将音频翻译并转录为英语\n",
    "\n",
    "\n",
    "目前文件上传限制为 25 MB，支持以下输入文件类型：`mp3、mp4、mpeg、mpga、m4a、wav 和 webm`。\n",
    "\n",
    "\n",
    "## 语音转录 Transcription API\n",
    "\n",
    "输入音频文件，返回转录对象（JSON）\n",
    "\n",
    "参数\n",
    "- **file**（文件）：需要转录的音频文件对象（不是文件名），支持以下格式：flac、mp3、mp4、mpeg、mpga、m4a、ogg、wav 或 webm。\n",
    "- **model**（'whisper-1'）：使用的模型 ID。目前仅可使用由我们的开源 Whisper V2 模型驱动的 whisper-1。\n",
    "- **language**（语言，可选）：输入音频的语言。提供 ISO-639-1 格式的输入语言可以提高准确性和响应速度。\n",
    "- **prompt**（提示，可选）：可选文本，用于指导模型的风格或继续前一个音频片段。提示应与音频语言相匹配。\n",
    "- **response_format**（响应格式，可选）：转录输出的格式，默认为 json。可选的格式有：json、text、srt、verbose_json 或 vtt。\n",
    "- **temperature**（温度，可选）：采样温度，范围从 0 到 1。更高的值，如 0.8，将使输出更随机，而更低的值，如 0.2，将使输出更集中和确定。如果设置为 0，模型将使用对数概率自动提高温度，直到达到某些阈值。\n",
    "- **timestamp_granularities[]**（时间戳粒度，可选）：为此转录填充的时间戳粒度，默认为 segment。响应格式必须设置为 verbose_json 才能使用时间戳粒度。支持以下一个或两个选项：word 或 segment。注意：segment 时间戳不增加额外延迟，但生成 word 时间戳会增加额外延迟。\n",
    "\n",
    "返回值\n",
    "- 转录对象（Transcription Object）或详细转录对象（Verbose Transcription Object）。\n",
    "\n",
    "### 转录对象\n",
    "\n",
    "**Transcription Object**:\n",
    "```json\n",
    "{\n",
    "  \"text\": \"Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that.\"\n",
    "}\n",
    "```\n",
    "\n",
    "**Verbose Transcription Object**:\n",
    "```json\n",
    "{\n",
    "  \"task\": \"transcribe\",\n",
    "  \"language\": \"english\",\n",
    "  \"duration\": 8.470000267028809,\n",
    "  \"text\": \"The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.\",\n",
    "  \"segments\": [\n",
    "    {\n",
    "      \"id\": 0,\n",
    "      \"seek\": 0,\n",
    "      \"start\": 0.0,\n",
    "      \"end\": 3.319999933242798,\n",
    "      \"text\": \" The beach was a popular spot on a hot summer day.\",\n",
    "      \"tokens\": [\n",
    "        50364, 440, 7534, 390, 257, 3743, 4008, 322, 257, 2368, 4266, 786, 13, 50530\n",
    "      ],\n",
    "      \"temperature\": 0.0,\n",
    "      \"avg_logprob\": -0.2860786020755768,\n",
    "      \"compression_ratio\": 1.2363636493682861,\n",
    "      \"no_speech_prob\": 0.00985979475080967\n",
    "    },\n",
    "    ...\n",
    "  ]\n",
    "}\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3150141f-f246-44e1-b6fe-43387d953783",
   "metadata": {},
   "source": [
    "### 使用 Whisper 实现中文转录\n",
    "\n",
    "将 TTS 配音的李云龙台词音频文件(liyunlong.mp3)发送给 Whisper 模型进行中文转录"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "84fcd642-213d-45b2-86c1-042657f5baa4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "二营长,你他娘的意大利泡呢?给我拉来!\n"
     ]
    }
   ],
   "source": [
    "from openai import OpenAI\n",
    "client = OpenAI()\n",
    "\n",
    "audio_file= open(\"./audio/liyunlong.mp3\", \"rb\")\n",
    "\n",
    "transcription = client.audio.transcriptions.create(\n",
    "  model=\"whisper-1\", \n",
    "  file=audio_file\n",
    ")\n",
    "\n",
    "print(transcription.text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a6388a91-b426-44f1-bd1e-89a17b744b2f",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "94fa82fb-b9f4-478e-af4e-97744326ba22",
   "metadata": {},
   "source": [
    "## 语音翻译 API\n",
    "\n",
    "输入音频文件，返回翻译文本\n",
    "\n",
    "请求体\n",
    "- **file**（文件）：需要翻译的音频文件对象（不是文件名），支持以下格式：flac、mp3、mp4、mpeg、mpga、m4a、ogg、wav 或 webm。\n",
    "- **model**（'whisper-1'）：使用的模型 ID。目前只有由我们的开源 Whisper V2 模型驱动的 whisper-1 可用。\n",
    "- **prompt**（提示，可选）：可选文本，用于指导模型的风格或继续前一个音频片段。提示应为英文。\n",
    "- **response_format**（响应格式，可选）：转录输出的格式，默认为 json。可选的格式包括：json、text、srt、verbose_json 或 vtt。\n",
    "- **temperature**（温度，可选）：采样温度，范围从 0 到 1。较高的值，如 0.8，将使输出更随机，而较低的值，如 0.2，将使输出更集中和确定。如果设置为 0，模型将使用对数概率自动增加温度，直到达到特定阈值。\n",
    "\n",
    "返回值\n",
    "- **translated_text**: 翻译后的文本。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c0cd53e0-c069-4a6c-964d-c2d348781433",
   "metadata": {},
   "source": [
    "### 使用 Whisper 实现中文识别+翻译\n",
    "\n",
    "将 TTS 配音的李云龙台词音频文件(liyunlong.mp3)发送给 Whisper 模型进行翻译"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "cf456da9-56b8-4795-bf26-2379e1772c20",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Second Battalion Commander, where is your Italian gun? Bring it to me.\n"
     ]
    }
   ],
   "source": [
    "audio_file= open(\"./audio/liyunlong.mp3\", \"rb\")\n",
    "\n",
    "translation = client.audio.translations.create(\n",
    "    model=\"whisper-1\", \n",
    "    file=audio_file,\n",
    "    prompt=\"Translate into English\",\n",
    ")\n",
    "\n",
    "print(translation.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c25de24-8cdb-4335-9028-fa7eb35dd6be",
   "metadata": {},
   "source": [
    "### 使用 TTS 给李云龙英文版台词配音"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "055a0016-9b49-4105-b64d-4ae3c4a543ed",
   "metadata": {},
   "outputs": [],
   "source": [
    "speech_file_path = \"./audio/liyunlong_en.mp3\"\n",
    "\n",
    "with client.audio.speech.with_streaming_response.create(\n",
    "    model=\"tts-1\",\n",
    "    voice=\"onyx\",\n",
    "    input=translation.text\n",
    ") as response:\n",
    "    response.stream_to_file(speech_file_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "004b6daa-6df7-42f3-93dd-ef4ce8d04ca9",
   "metadata": {},
   "source": [
    "### 使用 Whipser + TTS 生成郭德纲相声英文版"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "32bd206c-419f-4350-9eac-360b05c3a529",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are a lot of people here. Thank you for coming. Today is the Xiangsheng Conference. Today, He Yunwei and I, the two of us, have a special feature. We don't have much ability, but we give a lot. There are not many opportunities to cooperate with Mr. Xing. Give a little. This is also an old senior in our Xiangsheng world. Mr. Xing Wenzhao has been in the Xiangsheng circle for more than 50 years. That's right. We are going to hold a commemorative exhibition this year. Mr. Xing Wenzhao's third anniversary. 53rd anniversary. That's the 40th anniversary. Oh my god.\n"
     ]
    }
   ],
   "source": [
    "gdg_audio_file = open(\"./audio/gdg.mp3\", \"rb\")\n",
    "gdg_speech_file = \"./audio/gdg_en.mp3\"\n",
    "\n",
    "translation = client.audio.translations.create(\n",
    "  model=\"whisper-1\", \n",
    "  file=gdg_audio_file\n",
    ")\n",
    "\n",
    "print(translation.text)\n",
    "\n",
    "with client.audio.speech.with_streaming_response.create(\n",
    "    model=\"tts-1\",\n",
    "    voice=\"onyx\",\n",
    "    input=translation.text\n",
    ") as response:\n",
    "    response.stream_to_file(gdg_speech_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ae579e8e-a197-426c-86f7-5ba0d1659ea0",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
